Goto

Collaborating Authors

 integration test


Combining TSL and LLM to Automate REST API Testing: A Comparative Study

Barradas, Thiago, Paes, Aline, Neves, Vânia de Oliveira

arXiv.org Artificial Intelligence

The effective execution of tests for REST APIs remains a considerable challenge for development teams, driven by the inherent complexity of distributed systems, the multitude of possible scenarios, and the limited time available for test design. Exhaustive testing of all input combinations is impractical, often resulting in undetected failures, high manual effort, and limited test coverage. To address these issues, we introduce RestTSLLM, an approach that uses Test Specification Language (TSL) in conjunction with Large Language Models (LLMs) to automate the generation of test cases for REST APIs. The approach targets two core challenges: the creation of test scenarios and the definition of appropriate input data. The proposed solution integrates prompt engineering techniques with an automated pipeline to evaluate various LLMs on their ability to generate tests from OpenAPI specifications. The evaluation focused on metrics such as success rate, test coverage, and mutation score, enabling a systematic comparison of model performance. The results indicate that the best-performing LLMs - Claude 3.5 Sonnet (Anthropic), Deepseek R1 (Deepseek), Qwen 2.5 32b (Alibaba), and Sabia 3 (Maritaca) - consistently produced robust and contextually coherent REST API tests. Among them, Claude 3.5 Sonnet outperformed all other models across every metric, emerging in this study as the most suitable model for this task. These findings highlight the potential of LLMs to automate the generation of tests based on API specifications.


Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering

Ngo, Nghia Trung, Van Nguyen, Chien, Dernoncourt, Franck, Nguyen, Thien Huu

arXiv.org Artificial Intelligence

Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) in knowledge-intensive tasks such as those from medical domain. However, the sensitive nature of the medical domain necessitates a completely accurate and trustworthy system. While existing RAG benchmarks primarily focus on the standard retrieve-answer setting, they overlook many practical scenarios that measure crucial aspects of a reliable medical system. This paper addresses this gap by providing a comprehensive evaluation framework for medical question-answering (QA) systems in a RAG setting for these situations, including sufficiency, integration, and robustness. We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets for testing LLMs' ability to handle these specific scenarios. Utilizing MedRGB, we conduct extensive evaluations of both state-of-the-art commercial LLMs and open-source models across multiple retrieval conditions. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents. We further analyze the LLMs' reasoning processes to provides valuable insights and future directions for developing RAG systems in this critical medical domain.


Gradual Drift Detection in Process Models Using Conformance Metrics

Gallego-Fontenla, Victor, Vidal, Juan C., Lama, Manuel

arXiv.org Artificial Intelligence

Changes, planned or unexpected, are common during the execution of real-life processes. Detecting these changes is a must for optimizing the performance of organizations running such processes. Most of the algorithms present in the state-of-the-art focus on the detection of sudden changes, leaving aside other types of changes. In this paper, we will focus on the automatic detection of gradual drifts, a special type of change, in which the cases of two models overlap during a period of time. The proposed algorithm relies on conformance checking metrics to carry out the automatic detection of the changes, performing also a fully automatic classification of these changes into sudden or gradual. The approach has been validated with a synthetic dataset consisting of 120 logs with different distributions of changes, getting better results in terms of detection and classification accuracy, delay and change region overlapping than the main state-of-the-art algorithms.


GitHub - aws/sagemaker-python-sdk: A library for training and deploying machine learning models on Amazon SageMaker

#artificialintelligence

SageMaker Python SDK is an open source library for training and deploying machine learning models on Amazon SageMaker. With the SDK, you can train and deploy models using popular deep learning frameworks Apache MXNet and TensorFlow. You can also train and deploy models with Amazon algorithms, which are scalable implementations of core machine learning algorithms that are optimized for SageMaker and GPU training. If you have your own algorithms built into SageMaker compatible Docker containers, you can train and host models using these as well. For detailed documentation, including the API reference, see Read the Docs.


What is MLOps?

#artificialintelligence

Ever liked something on Instagram and then, almost immediately, had related content in your feed? Or search for something on Google and then be spammed with ads for that exact thing moments later? These are symptoms of an increasingly automated world. Behind the scenes, they are the result of state-of-the-art MLOps pipelines. We take a look at MLOps and what it takes to deploy machine learning models effectively. We start by discussing some key aspects of DevOps.


Effective Testing for Machine Learning (Part I)

#artificialintelligence

Update: Part II is out now! This blog post series describes a strategy I've developed over the last couple of years to test Machine Learning projects effectively. Given how uncertain ML projects are, this is an incremental strategy that you can adopt as your project matures; it includes test examples to provide a clear idea of how these tests look in practice, and a complete project implemented with Ploomber is available on GitHub. By the end of the post, you'll be able to develop more robust ML pipelines. Testing Machine Learning projects is challenging. Training a model is a long-running task that may take hours to run and has a non-deterministic output, which is the opposite we need to test software: quick and deterministic procedures.


Effective Testing for Machine Learning (Part I)

#artificialintelligence

This blog post series describes a strategy I've developed over the last couple of years to test Machine Learning projects effectively. Given how uncertain ML projects are, this is an incremental strategy that you can adopt as your project matures; it includes test examples to provide a clear idea of how these tests look in practice, and a complete project implemented with Ploomber is available on GitHub. By the end of the post, you'll be able to develop more robust ML pipelines. Testing Machine Learning projects is challenging. Training a model is a long-running task that may take hours to run and has a non-deterministic output, which is the opposite we need to test software: quick and deterministic procedures. One year ago, I published a post on testing data-intensive projects to make Continuous Integration feasible.


How Uber Implements CI/CD Of Machine Learning Models

#artificialintelligence

The ride-hailing giant Uber is currently present in 10K cities across 71 countries, and its platform is used by 93 million customers and 3.5 million drivers globally. Every quarter, the ride-hailing platform completes nearly 1.44 billion trips. However, as a result of a global pandemic and travel restrictions, the total number of quarterly Uber trips decreased by 24.21% in 2020. "At Uber, we have witnessed a significant increase in ML adoption across various organisations and use-cases over the last few years," said the company in its latest blog post co-authored by Yi Zhang, Joseph Wang, Jia Li, and Yunfeng Bai. The blog further highlighted various pain points, alongside explaining the solution implementation of continuous integration (CI) and continuous deployment (CD) of machine learning models as a solution.


HBO Max mocked and consoled after sending odd 'integration test' email – as it blames message on intern

The Independent - Tech

HBO Max has been mocked and consoled after sending out an unusual email to its customers. The message – apparently sent to a significant numbers of the service's subscribers – was not advertising a new show or feature, but rather included only a cryptic message that appeared to have been sent out by mistake. "This template is used by integration tests only." As recipients opened the email, and quickly realised that it had been sent by mistake, the reaction ranged from mockery to sympathy for the person who had clearly sent what was an internal test email out to potentially millions of subscribers. Live facial recognition technology creates'supercharged CCTV' that could be used recklessly, Information Commission warns Bitcoin price news – live: Crypto struggles to bounce back as slump continues Nasa attempting to restart Hubble Space Telescope after it was forced into'safe mode' by computer error Live facial recognition technology creates'supercharged CCTV' that could be used recklessly, Information Commission warns Nasa attempting to restart Hubble Space Telescope after it was forced into'safe mode' by computer error Many joked that the integration test email sounded like a show that could be on the service.


Good Software Engineering Practices for Data Scientists

#artificialintelligence

There are no hard and fast rules of how you must approach a problem, how you should implement it, however there are some certain standards. Often, you will be working on a team, or might be working in an open source project where many others will work on the same program with you. Your code might even be used as production code. So there needs to be a certain standards to follow. Data scientists might come from different backgrounds.